Syntactic indexing: enqueuer and scheduler #62485

keynmol · 2024-05-07T11:35:25Z

Fixes GRAPH-124
Fixes GRAPH-121

This PR introduces three main components:

Enqueuer – a thin layer responsible for actually inserting the syntactic indexing records into the database.
Scheduler - a service that
- Identifies repositories that haven't been processed in a while
- Identifies policies that match those repositories (policies that have syntactic indexing enabled)
- Identifies commits that match any of the policies
- And finally, enqueues the jobs to index the discovered repositories and commits
Scheduler job – a periodic routine that triggers Scheduler on with specified interval. This job runs as part of the main Worker service, and only schedules jobs if the experimental syntactic indexing feature is enabled.

Refactoring:

Making some methods public in policies.Service to make it easier to test logic that depends on glob matching of repository names (this matching requires a separate state to be updated)
Extracting some test utilities into a separate package

TODO:

~~[ ] Tests for policy iterator~~ - we're currently investigating whether policy iterator is needed at all, as pagination of policies is in question.
Tests for Scheduler
Wire in experimental feature flag

Test plan

New tests for all components

It initialises a new database which of course doesn't contain any of the test data

internal/codeintel/syntactic_indexing/scheduler_test.go

github-actions · 2024-05-23T11:05:42Z

Caution

License checking failed, please read: how to deal with third parties licensing.

github-actions · 2024-05-23T11:32:33Z

Caution

License checking failed, please read: how to deal with third parties licensing.

varungandhi-src

Left some initial comments; will take a deeper look at more of the code shortly.

cmd/worker/shared/init/db/db.go

internal/codeintel/syntactic_indexing/enqueuer.go

internal/codeintel/syntactic_indexing/internal/policy_iterator.go

internal/codeintel/syntactic_indexing/jobstore/store.go

internal/codeintel/syntactic_indexing/scheduler_job.go

internal/codeintel/syntactic_indexing/scheduler_config.go

internal/codeintel/syntactic_indexing/scheduler.go

internal/codeintel/syntactic_indexing/scheduler_test.go

varungandhi-src

Suggestions

(Strong) Let's avoid doing the repo check again inside the doubly nested loop.
(Weak) Let's avoid doing the commit check multiple times
(Weak) Let's avoid revisiting the same commit multiple times per repo
(Weak) Let's simplify the env vars. We can make it more complicated later.
(Strong) Please clarify the policies array elements in scheduler_test.go

Would prefer Strong suggestions to be addressed pre-merge.

Thanks for your patience. It's been a bit hard for me to wrap my head around all the logic here, so this took longer to review than anticipated.

Additionally, harden the types used for repository and commit values

keynmol · 2024-05-30T13:33:39Z

internal/codeintel/syntactic_indexing/scheduler.go

+	commitsToSchedule := make(map[api.RepoID]collections.Set[api.CommitID])
+	enqueueOptions := EnqueueOptions{force: false}
+
+	var allErrors errors.MultiError
+
+	for _, repoToIndex := range repos {
+		repo, _ := s.RepoStore.Get(ctx, api.RepoID(repoToIndex.ID))
+		policyIterator := internal.NewPolicyIterator(s.PoliciesService, repoToIndex.ID, internal.SyntacticIndexing, schedulerConfig.PolicyBatchSize)
+		err := policyIterator.ForEachPoliciesBatch(ctx, func(policies []policiesshared.ConfigurationPolicy) error {
+			commitMap, err := s.PolicyMatcher.CommitsDescribedByPolicy(ctx, int(repoToIndex.ID), repo.Name, policies, currentTime)
+
+			if err != nil {
+				return err
+			}
+
+			for commit, policyMatches := range commitMap {
+				if len(policyMatches) == 0 {
+					continue
+				}
+				if commits := commitsToSchedule[repo.ID]; commits != nil {
+					commits.Add(api.CommitID(commit))
+				} else {
+					commitsToSchedule[repo.ID] = collections.NewSet(api.CommitID(commit))
+				}
+			}
+
+			return nil
+		})
+
+		if err != nil {
+			allErrors = errors.Append(allErrors, errors.Newf("Failed to discover commits eligible for syntactic indexing for repo [%s]: %v", repo.Name, err))
+		}
+	}
+
+	for repoId, commits := range commitsToSchedule {
+		for _, commitId := range commits.Values() {
+			if _, err := s.Enqueuer.QueueIndexingJobs(ctx, repoId, commitId, enqueueOptions); err != nil {
+				allErrors = errors.Append(allErrors, errors.Newf("Failed to schedule syntactic indexing of repo [ID=%s], commit [%s]: %v", repoId, commitId, err))
+			}
+		}
+	}


@varungandhi-src I've re-done this part, main highlights:

Collecting all repos and commits once

Removing fail fast behavior in repo iterator

Removing fail fast behavioru in (repo, commit) iterator

Additionally the enqueuer no longer performs revision checks, we'll leave that to the syntactic worker itself.

This means the scheduler will always do best effort scheduling, while returning all accumulated errors.

I hope my understanding of return allErrors is right here and equivalent to return nil if no allErrors = errors.Append calls ever happened.

Thanks, this looks much better!

I hope my understanding of return allErrors is right here and equivalent to return nil if no allErrors = errors.Append calls ever happened.

Yep, we're following this pattern elsewhere too. I think you don't even need to declare the type as MultiError, you can see other places where errors.Append( can take the first argument as just error instead of MultiError. (search for var errs error)

keynmol · 2024-05-31T09:24:52Z

I will merge this PR and if there are any other serious issues to attend to, it can happen in subsequent PRs, given this component is disabled by default.

cla-bot bot added the cla-signed label May 7, 2024

github-actions bot added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) team/product-platform labels May 7, 2024

keynmol added 8 commits May 13, 2024 12:10

WIP: Syntactic enqueuer

e1b62e8

Add observability

ebbd351

Tests for syntactic indexing enqueuer

d120724

Syntactic indexing scheduler and enqueuer

81f16d2

Separate scheduler into a testable interface

5c244af

Allow injecting raw database to aid with testing

656ec9d

Wire in metrics

5f26d6e

WIP

03414ce

keynmol force-pushed the syntactic-indexing-enqueuer branch from 11d5272 to 03414ce Compare May 13, 2024 11:11

keynmol added 4 commits May 13, 2024 12:24

Wire scheduling job and check site config

26a138b

Enqueuer test and refactor test helpers

eb8e860

Refactor config passing to ensure we don't use default values

58102a0

Remove usage of global InitServices

d92c644

It initialises a new database which of course doesn't contain any of the test data

keynmol changed the title ~~Syntactic indexing: enqueuer~~ Syntactic indexing: enqueuer and scheduler May 20, 2024

keynmol added 3 commits May 22, 2024 10:47

Cleanup scheduler bootstrap and move towards working tests

6797dbc

Make policies store a non-internal module to use from tests

c5a2f87

Add more tests to scheduler

07e823f

keynmol commented May 22, 2024

View reviewed changes

internal/codeintel/syntactic_indexing/scheduler_test.go Outdated Show resolved Hide resolved

keynmol added 2 commits May 23, 2024 09:49

Restore policy store to internal package

bc31f76

Rejig Policy service interface to allow using it in tests

42022f2

Merge branch 'main' into syntactic-indexing-enqueuer

7c4030e

keynmol marked this pull request as ready for review May 23, 2024 11:30

varungandhi-src reviewed May 23, 2024

View reviewed changes

keynmol added 2 commits May 24, 2024 11:06

PR comments

51f023d

Regenerate Bazel build files

fb7e2f8

Whitespace / naming fixes - avoid codeintelDB & codeIntelDB

dbb0d11

varungandhi-src reviewed May 30, 2024

View reviewed changes

internal/codeintel/syntactic_indexing/scheduler_job.go Outdated Show resolved Hide resolved

varungandhi-src reviewed May 30, 2024

View reviewed changes

internal/codeintel/syntactic_indexing/scheduler_config.go Outdated Show resolved Hide resolved

varungandhi-src reviewed May 30, 2024

View reviewed changes

internal/codeintel/syntactic_indexing/scheduler.go Outdated Show resolved Hide resolved

Reduce blank lines

10d8d97

varungandhi-src reviewed May 30, 2024

View reviewed changes

internal/codeintel/syntactic_indexing/scheduler.go Outdated Show resolved Hide resolved

varungandhi-src reviewed May 30, 2024

View reviewed changes

internal/codeintel/syntactic_indexing/scheduler.go Outdated Show resolved Hide resolved

varungandhi-src added 6 commits May 30, 2024 19:22

Rearrange imports

61af571

Rename policies -> policiesSvc for clarity

56812e5

Simplify test code a bit

4d8db78

Whitespace + add defensive check in test

0eb2992

Make whitespace meaningful

37ad21c

Whitespace/renaming

f7579eb

varungandhi-src reviewed May 30, 2024

View reviewed changes

internal/codeintel/syntactic_indexing/scheduler_test.go Show resolved Hide resolved

varungandhi-src approved these changes May 30, 2024

View reviewed changes

keynmol added 8 commits May 30, 2024 14:20

Remove redundant revision checks and extraneous gitserver calls

669bf89

Additionally, harden the types used for repository and commit values

Add clarifying comment in the test

57a6035

Fixup comments

cc953ad

Better names

53a35b6

remove unused import

0a978d3

Remove env fallback

a25480b

Restore changes from bad rebase.

821e175

Remove redundant private function

78c8f0e

keynmol commented May 30, 2024

View reviewed changes

keynmol added 3 commits May 30, 2024 14:35

Remove gitserver dependency from enqueuer

6080331

Fix tests

e300c49

Add counter for jobs that were skipped

3e932f1

keynmol merged commit 2821447 into main May 31, 2024
11 checks passed

keynmol deleted the syntactic-indexing-enqueuer branch May 31, 2024 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syntactic indexing: enqueuer and scheduler #62485

Syntactic indexing: enqueuer and scheduler #62485

keynmol commented May 7, 2024 •

edited

Loading

github-actions bot commented May 23, 2024

github-actions bot commented May 23, 2024

varungandhi-src left a comment

varungandhi-src left a comment

keynmol May 30, 2024

varungandhi-src May 30, 2024

keynmol commented May 31, 2024

Syntactic indexing: enqueuer and scheduler #62485

Syntactic indexing: enqueuer and scheduler #62485

Conversation

keynmol commented May 7, 2024 • edited Loading

Test plan

github-actions bot commented May 23, 2024

github-actions bot commented May 23, 2024

varungandhi-src left a comment

Choose a reason for hiding this comment

varungandhi-src left a comment

Choose a reason for hiding this comment

keynmol May 30, 2024

Choose a reason for hiding this comment

varungandhi-src May 30, 2024

Choose a reason for hiding this comment

keynmol commented May 31, 2024

keynmol commented May 7, 2024 •

edited

Loading